Introduction

The purpose of this workflow is to analyze the results from 2.1_chromaprobs.Rmd That workflow imputed local chromatin states for each transcript, and scanned those states for effects on transcription in the DO.

This workflow compares the eQTL mapped to chromatin and to haplotypes.

library(here);library(qtl2);library(gprofiler2);library(stringr)
library(msigdbr);library(fgsea);library(pheatmap);library(biomaRt)
library(knitr)
num.states = 9
all.var <- ls()
lib.loaded <- as.logical(length(which(all.var == "mus")))
if(!lib.loaded){
  mus <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "mmusculus_gene_ensembl", 
  host = "may2017.archive.ensembl.org")
}

#att <- listAttributes(mus)
#fil <- listFilters(mus)
#query.term <- "island"
#att[grep(query.term, att[,1]),]
#fil[grep(query.term, fil[,1]),]
is.interactive = FALSE
#is.interactive = TRUE
all.code.dir <- list.files(here("Code"), full.names = TRUE)
for(i in 1:length(all.code.dir)){
    all.fun <- list.files(all.code.dir[i], full.names = TRUE)
    for(j in 1:length(all.fun)){source(all.fun[j])}
}
data(CCcolors)
chrom.colors <- colors.from.values(1:num.states, use.pheatmap.colors = TRUE)

Load data

ch.coef <- readRDS(here("Results", "chQTL", "chQTL.coef.RDS"))
ch.lod <- readRDS(here("Results", "chQTL", "chQTL.lod.RDS"))
ch.r2 <- readRDS(here("Results", "chQTL", "chQTL.R2.RDS"))
e.coef <- readRDS(here("Results", "eQTL", "eQTL.coef.RDS"))
e.lod <- readRDS(here("Results", "eQTL", "eQTL.lod.RDS"))
e.r2 <- readRDS(here("Results", "eQTL", "eQTL.R2.RDS"))

chrom.mats <- readRDS(here("Results", "ChromHMM", paste0(num.states, "_states_C"), 
paste0("Chromatin_States_", num.states, "_full_gene_1000.RData")))
chrom.states <- readRDS(here("Data", "chQTL", "Expanded_Chromatin.RDS"))

do.data <- load(here("Data", "DOQTL", "Svenson_DO850_for_eQTL_viewer_v9.RData"))
expr <- dataset.mrna$data$rz
expr.norm <- dataset.mrna$data$norm
covar <- dataset.mrna$covar.matrix
diet.locale <- lapply(c(0,1), function(x) which(covar[,"diethf"] == x))
transcript.haplotypes <- readRDS(here("Data", "DOQTL", "transcript.haplotypes.RDS"))

strain.key <- read.table(here("Data", "support_files", "strain.color.table.txt"), 
sep = "\t", stringsAsFactors = FALSE, comment.char = "*")
transcript.info.file <- here("Data", "RNASeq", "RNASeq_gene_info.RData")
if(!file.exists(transcript.info.file)){
    all.genes <- unique(names(chrom.states), colnames(expr))
    transcript.info <- getBM(attributes = c("ensembl_gene_id", "external_gene_name", 
    "chromosome_name", "start_position", "end_position", "transcription_start_site", 
    "exon_chrom_start", "exon_chrom_end", "5_utr_start","5_utr_end","3_utr_start", 
    "3_utr_end", "strand"), 
    filters = "ensembl_gene_id", values = all.genes, mart = mus)
    saveRDS(transcript.info, transcript.info.file)
}else{
  transcript.info <- readRDS(transcript.info.file)
}
gene.names <- transcript.info[match(rownames(e.lod), transcript.info[,"ensembl_gene_id"]),"external_gene_name"]

Matching genetic eQTLs with chromatin eQTLs

The following plot shows the LOD scores of genetic eQTLs and chromatin eQTLs plotted against each other. For the most part, the LOD scores agree with each other very well.

However, there is also a fairly large group of genes for which there is a mismatch between the two associations. These are genes for which there is a cis-eQTL, but relatively lower association with the chromatin state. The distribution of these differences is shown in the right-hand panel.

max.ch.lod <- t(sapply(ch.lod, function(x) if(length(x) > 0){apply(x, 2, function(y) max(y, na.rm = TRUE))}else{rep(NA, 4)}))
if(is.interactive){quartz(width = 8, height = 4)}
par(mfrow = c(1,2))
plot(e.lod[,1], max.ch.lod[,1], pch = 16, col = "gray", cex = 0.5,
xlab = "genetic eQTL LOD score", ylab = "chromatin eQTL LOD score",
main = "Genetic vs. Chromatin-based eQTL LOD scores")
abline(0,1, col = "salmon")

lod.diff <- e.lod[,1] - max.ch.lod[,1]
hist(lod.diff, breaks = 100, main = "LOD score differences")

ordered.lod <- sort(lod.diff, decreasing = TRUE)

Genetic and Chromatin R2 Comparison

We also calculated the R2 for each linear model using either haplotype probabilities or chromatin probabilities plus covariates to explain transcript level. This way we can get a sense of how much variance in the expression each type of data is explaining.

The chromatin variance explained and genetics variance explained are very highly correlated. There are many cases in which the chromatin does not explain as much variance as the genetics.

max.ch.r2 <- t(sapply(ch.r2, function(x) if(length(x) == 0){return(rep(NA, 4))}else{apply(x, 2, function(y) max(y, na.rm = TRUE))}))
r2.diff <- e.r2 - max.ch.r2

if(is.interactive){quartz(width = 8, height = 4)}
par(mfrow = c(1,2))
#plot.hexbin(e.r2[,1], max.ch.r2[,1], xbins = 50)
#test <- plot.with.model(e.r2[,1], max.ch.r2[,1], report = "cor.test")

plot(e.r2[,1], max.ch.r2[,1], pch = 16, col = "gray", cex = 0.7,
xlab = "R2 Genetic eQTL", ylab = "R2 Chromatin eQTL", 
main = "Comparison of Chromatin and Genetic R2")
abline(0,1)

large.r2.diff <- get.percentile(r2.diff, 95)
hist(r2.diff[,1], breaks = 100, main = "R2 Difference Distribution", 
xlab = "R2 Difference")
abline(v = large.r2.diff)

We defined genes with a big mismatch between the genetic and chromatin eQTL scores as genes in the top 95% of the difference distribution.

r2diff.order <- apply(r2.diff, 2, function(x) order(x, decreasing = TRUE))
big.diff <- apply(r2.diff, 2, function(x) rownames(r2.diff)[which(x > large.r2.diff)])

This set of 392 genes is highly enriched for RNA and non-coding RNA processing.

diff.enrich <- gost(big.diff[[1]], organism = "mmusculus")
if(is.interactive){quartz(width = 8, height = 4)}
par(mfrow = c(1,2))
plot.enrichment.wordcloud(diff.enrich, max.term.size = 2000)

Positional Chromatin LOD score

We normalized the transcript position to examine whether high LOD scores occurred mostly at the TSS, or at other regions.

It actually appears that there is a dip in average LOD score right at the TSS, and a peak in average LOD score in the initial part of the gene after the TSS.

test.idx <- 1:length(ch.lod)
#test.idx <- big.diff.locale

norm.ch <- vector(mode = "list", length = length(ch.lod[test.idx]))
names(norm.ch) <- rownames(e.lod)[test.idx]
for(x in 1:length(test.idx)){
  if(is.interactive){report.progress(x, length(test.idx))}
  if(!is.null(ch.lod[[x]])){
  norm.ch[[x]] <- center.on.feature(gene.names[x], transcript.info, ch.lod[[x]][,1], 
  feature = "full")
  }
}

if(is.interactive){quartz()}
lod.by.pos <- plot.centered.vals(norm.ch, seq.by = 0.1, min.representation = 10, 
ylim = c(0, 10), plot.label = "Mean LOD Score by Position", 
ylab = "Mean LOD Score", return.means = FALSE)
abline(v = c(0,1))

Many genes have no variation in chromatin state across strains right at the TSS. We excluded these genes and looked at average LOD score by position only for the genes that have variation in chromatin state at the TSS across strains.

For these genes, the highest LOD scores are immediately upstream of the TSS

chrom.gene.names <- transcript.info[match(names(chrom.mats), transcript.info[,"ensembl_gene_id"]),"external_gene_name"]
norm.states <- lapply(1:length(chrom.mats), 
function(x) if(length(chrom.mats[[x]]) > 1){center.on.feature(chrom.gene.names[x], 
transcript.info, chrom.mats[[x]][,1], feature = "full")}else{NA})
tss.locale <- lapply(norm.states, function(x) get.nearest.pt(0, as.numeric(names(x))))
tss.states <- lapply(1:length(tss.locale), function(x) if(length(chrom.mats[[x]]) > 1){chrom.mats[[x]][tss.locale[[x]],]}else{NA})
numeric.states <- sapply(tss.states, function(x) if(length(x) > 1){length(unique(x))}else{NA})
var.locale <- which(numeric.states > 1)
var.genes <- names(chrom.mats)[var.locale]

var.gene.locale <- match(var.genes, names(norm.ch))

if(is.interactive){quartz()}
var.lod.by.pos <- plot.centered.vals(norm.ch[var.gene.locale], 
seq.by = 0.1, min.representation = 10, 
ylim = c(0, 10), plot.label = "Mean LOD Score by Position", 
ylab = "Mean LOD Score", return.means = FALSE)
abline(v = c(0,1))

Plotting these two scores on the same graph shows the dip in LOD score for all genes (black) compared with the peak at the TSS for the genes with variation at that point (blue).

mean.var.lod <- colMeans(var.lod.by.pos, na.rm = TRUE)
mean.all.lod <- colMeans(lod.by.pos, na.rm = TRUE)

if(is.interactive){quartz()}
plot(as.numeric(colnames(var.lod.by.pos)), mean.var.lod, type = "l", col = "blue",
ylab = "Mean LOD score", xlab = "Relative Position", lwd = 3)
points(as.numeric(colnames(lod.by.pos)), mean.all.lod, type = "l", lwd = 3)
abline(v = c(0,1))

Conclusion

Most genes have no variation in chromatin at the TSS, and thus chromatin state at this position is not used for regulating gene expression across strains.

However, a subset of genes do have variation in chromatin state at the TSS, and for these genes, chromatin state at this position is used in regulation of gene expression.

Chromatin state coefficients by position

In the figures below, we show the average coefficients for each state across the gene body. We only include non-zero coefficients. This tells us the effect of each state at each position when it actually varies at that position.

I had done this earlier using all coefficients, and it looked as if state 7 had a dip in effect at the TSS. It does, but only because it is invariant at the TSS across strains for most genes. We should also have a measure of presence and variance across the gene body.

The chromatin states have the same directionality biases that we observed in the inbred strains. For example, states 1 and 2 are negatively associated with expression, and state 7 is positively associated with expression.

We see very similar spatial pattern to that seen in the inbred mice.

State 8 coefficients dip right at the TSS. State 7 coefficients are high everywhere, but take a dip at the TSS. This is a little different from the inbred mice, where the dip seems to occur upstream of the TSS. State 5 coefficients were high in the body of the gene State 4 coefficients were pretty close to 0 everywhere State 3 coefficients took a little dip by the TSS State 1 and state 2 coefficients were low everywhere, especially near the TSS

norm.ch.coef <- lapply(ch.coef, function(x) x[[1]])
names(norm.ch.coef) <- rownames(e.lod)
for(x in 1:length(ch.coef)){
  if(is.interactive){report.progress(x, length(ch.coef))}
  if(!is.null(ch.coef[[x]])){
  norm.pos <- center.on.feature(gene.names[x], transcript.info, ch.coef[[x]][[1]][,1], 
  feature = "full")
  rownames(norm.ch.coef[[x]]) <- names(norm.pos)
  }
}

state.col <- colors.from.values(1:num.states, use.pheatmap.colors = TRUE)
all.state.coef <- vector(mode = "list", length = num.states)
names(all.state.coef) <- 1:num.states
for(s in 1:num.states){
  if(is.interactive){quartz(width = 6, height = 4)}
  cat("### State", s, "\n")
  norm.state <- lapply(norm.ch.coef, function(x) x[,s])
  non.missing.state <- norm.state
  for(i in 1:length(norm.state)){
    zero.locale <- which(non.missing.state[[i]] == 0)
    if(length(zero.locale) > 0){
      non.missing.state[[i]][zero.locale] <- NA
    }
  }
  all.state.coef[[s]] <- plot.centered.vals(non.missing.state, seq.by = 0.1, 
  min.representation = 15, ylim = c(-0.3, 0.3), plot.label = paste("State", s), 
  ylab = "Mean Non-Zero Coefficient", return.means = FALSE, min.upstream = -1, 
  max.downstream = 2)
  abline(h = 0, v = c(0,1))
  cat("\n\n")
}

State 1

State 2

State 3

State 4

State 5

State 6

State 7

State 8

State 9

plot_poly_mat <- function(state.coef, poly.col = "gray", ylim = c(-1,1), 
  xlim = c(-1, 2), plot.label = "State"){

  state.mean <- colMeans(state.coef, na.rm = TRUE)
  state.n <- apply(state.coef, 2, function(x) length(which(!is.na(x))))
  state.sd <- apply(state.coef, 2, function(x) sd(x, na.rm = TRUE))
  state.se <- state.sd/sqrt(state.n)
  state.x <- as.numeric(colnames(state.coef))

  one.state.col <- col2rgb(poly.col)
  
  plot.new()
  plot.window(xlim = xlim, ylim = ylim)
  plot.poly.xy(poly.top.x = state.x, poly.bottom.x = state.x, 
    poly.top.y = state.mean+state.se, poly.bottom.y = state.mean - state.se, 
    col = rgb(one.state.col[1]/256, one.state.col[2]/256, one.state.col[3]/256, alpha = 0.5))
  axis(2, cex.axis = 0.7)
  abline(v = c(0,1), col = "darkgray", lty = 2, lwd = 2)
  abline(h = 0)
  par(xpd = TRUE)
  text(x = 2, y = 0, labels = plot.label, srt = 270, cex = 1.5)
  par(xpd = FALSE)
}

if(is.interactive){quartz(width = 6, height = 7)}
layout.mat <- matrix(c(1:9, 0), ncol = 2, byrow = FALSE)
layout(layout.mat)
par(mar = c(0,2,0,2))
for(i in 1:length(all.state.coef)){
  #quartz(width = 3.5, height = 3)
  plot_poly_mat(all.state.coef[[i]], poly.col = state.col[i], 
  ylim = c(-0.3, 0.35), xlim = c(-1, 2),
  plot.label = paste("State", i))
  mtext(paste(nrow(all.state.coef[[i]]), "genes"), side = 3, line = -2.5)
}

Effect size of chromatin does not depend on position

Do the dips we see above represent dips toward zero or away from zero? Below we show the absolute value of the average coefficient to get a better handle on which states have stronger effects near the TSS, and which states have weaker effects.

The strength of the effect of state 7 takes a dip at the TSS. The strength of the effect of state 1 also take a bit of a dip right at the TSS, but is high on either side. The average effect of state 4 goes up a bit at the TSS. States 2, 3, 5, and 8 are all fairly level across the gene.

all.state.mag <- vector(mode = "list", length = num.states)
names(all.state.mag) <- 1:num.states
for(s in 1:num.states){
  if(is.interactive){quartz(width = 10, height = 4)}
  cat("### State", s, "\n")
  norm.state <- lapply(norm.ch.coef, function(x) x[,s])
  abs.state <- lapply(norm.state, function(x) if(length(x) > 0){abs(x)}else{NA})
  non.missing.state <- abs.state
  for(i in 1:length(norm.state)){
    zero.locale <- which(non.missing.state[[i]] == 0)
    if(length(zero.locale) > 0){
      non.missing.state[[i]][zero.locale] <- NA
    }
  }
  all.state.mag[[s]] <- plot.centered.vals(non.missing.state, seq.by = 0.1, 
  min.representation = 10, ylim = c(0, 0.5), plot.label = paste("State", s), 
  ylab = "Mean Non-Zero Coefficient Magnitude", return.means = FALSE)
  abline(h = 0, v = c(0,1))
  #abline(h = mean(all.state.mag[[s]], na.rm = TRUE))
  cat("\n\n")
}

State 1

State 2

State 3

State 4

State 5

State 6

State 7

State 8

State 9

cat("### Mean State Effect Size\n")

Mean State Effect Size

barplot(sapply(all.state.mag, function(x) mean(x, na.rm = TRUE)), col = chrom.colors,
main = "Mean Effect Magnitude")

cat("\n")

State Variation by Position

The figures below show the proportion of the genes at any given gene body position that have variation in the given state across strains.

What these figures show is that the placement of state 7 is highly consistent right at the TSS because there is a dip in variation right at that point. Variation then increases later in the body of the gene. This is the lowest value seen for any state, suggesting its placement at the TSS is important to transcription.

On the other hand, the variation in state 8 goes up at the TSS, suggesting that it may play a more regulatory role in that position.

Other states are much less conserved relative to the position in the gene body. States 4, 5, 6 have high variation throught the gene body.

State 1 is an interesting one. We know from the analysis in the inbred animals that its abundance is very low near the TSS and then increases as you move toward the TES. However the proportion of genes with variation in that state across strains decreases as you move toward the TSS.

all.state.var <- vector(mode = "list", length = num.states)
names(all.state.var) <- 1:num.states
for(s in 1:num.states){
  if(is.interactive){quartz(width = 10, height = 4)}
  cat("### State", s, "\n")
  norm.state <- lapply(norm.ch.coef, function(x) x[,s])
  non.missing.state <- norm.state
  for(i in 1:length(norm.state)){
    non.zero.locale <- which(non.missing.state[[i]] != 0)
    if(length(non.zero.locale) > 0){
      non.missing.state[[i]][non.zero.locale] <- 1
    }
  }
  all.state.var[[s]] <- plot.centered.vals(non.missing.state, seq.by = 0.01, 
  min.representation = 10, ylim = c(0, 1.1), plot.label = paste("State", s), 
  ylab = "Proportion Genes With Variation in State", return.means = FALSE)
  abline(h = 0, v = c(0,1))
  cat("\n\n")
}

State 1

State 2

State 3

State 4

State 5

State 6

State 7

State 8

State 9